AITopics | code unit

Collaborating Authors

code unit

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

UTF-8 Plumbing: Byte-level Tokenizers Unavoidably Enable LLMs to Generate Ill-formed UTF-8

Firestone, Preston, Ugare, Shubham, Singh, Gagandeep, Misailovic, Sasa

arXiv.org Artificial IntelligenceNov-11-2025

Subword tokenization segments input text according to a pre-defined vocabulary to feed it into a language model; the language model, in turn, generates a sequence made from this same vocabulary. The members of the vocabulary can be built of code points or bytes. Using code points means that all members of the vocabulary are valid UTF-8 characters. However, it also requires thousands of initial members to achieve acceptable coverage of inputs. Beginning with bytes, on the contrary, avoids out-of-vocabulary errors with only 256 initial members of the vocabulary, but the members of the vocabulary and sequences of them are not guaranteed to be valid UTF-8. Sequences that are not valid UTF-8 break code that assumes its input to be valid UTF-8. Applications of language models must account for the breakage thereby introduced. In this paper, we formalize tokenization using monoid theory and prove that tokenizers whose vocabularies contain tokens that are ill-formed UTF-8 can always produce sequences that are ill-formed UTF-8. We demonstrate formally that attempting to incrementally convert tokens back to a string and interpret the results as UTF-8 gives different results than converting the whole sequence of tokens at once. This formal result predicts real-world bugs: we evaluate mitigations for the problem identified and provide case studies of major foundation models, serving engines, and constrained generation systems.

large language model, logic & formal reasoning, natural language, (20 more...)

arXiv.org Artificial Intelligence

2511.05578

Country:

Europe (1.00)
Asia (1.00)
North America > United States (0.68)

Genre: Research Report (0.52)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Logic & Formal Reasoning (0.93)

Add feedback

Project-Level C-to-Rust Translation via Synergistic Integration of Knowledge Graphs and Large Language Models

Yuan, Zhiqiang, Mao, Wenjun, Chen, Zhuo, Shang, Xiyue, Wang, Chong, Lou, Yiling, Peng, Xin

arXiv.org Artificial IntelligenceOct-14-2025

Translating C code into safe Rust is an effective way to ensure its memory safety. Compared to rule-based translation which produces Rust code that remains largely unsafe, LLM-based methods can generate more idiomatic and safer Rust code because LLMs have been trained on vast amount of human-written idiomatic code. Although promising, existing LLM-based methods still struggle with project-level C-to-Rust translation. They typically partition a C project into smaller units (\eg{} functions) based on call graphs and translate them bottom-up to resolve program dependencies. However, this bottom-up, unit-by-unit paradigm often fails to translate pointers due to the lack of a global perspective on their usage. To address this problem, we propose a novel C-Rust Pointer Knowledge Graph (KG) that enriches a code-dependency graph with two types of pointer semantics: (i) pointer-usage information which record global behaviors such as points-to flows and map lower-level struct usage to higher-level units; and (ii) Rust-oriented annotations which encode ownership, mutability, nullability, and lifetime. Synthesizing the \kg{} with LLMs, we further propose \ourtool{}, which implements a project-level C-to-Rust translation technique. In \ourtool{}, the \kg{} provides LLMs with comprehensive pointer semantics from a global perspective, thus guiding LLMs towards generating safe and idiomatic Rust code from a given C project. Our experiments show that \ourtool{} reduces unsafe usages in translated Rust by 99.9\% compared to both rule-based translation and traditional LLM-based rewriting, while achieving an average 29.3\% higher functional correctness than those fuzzing-enhanced LLM methods.

large language model, machine learning, translation, (19 more...)

arXiv.org Artificial Intelligence

2510.10956

Country: North America > United States > California (0.67)

Genre: Research Report (0.82)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.68)

Add feedback

Commenting Higher-level Code Unit: Full Code, Reduced Code, or Hierarchical Code Summarization

Sun, Weisong, Zhang, Yiran, Zhu, Jie, Wang, Zhihui, Fang, Chunrong, Zhang, Yonglong, Feng, Yebo, Huang, Jiangping, Wang, Xingya, Jin, Zhi, Liu, Yang

arXiv.org Artificial IntelligenceMar-13-2025

Commenting code is a crucial activity in software development, as it aids in facilitating future maintenance and updates. To enhance the efficiency of writing comments and reduce developers' workload, researchers has proposed various automated code summarization (ACS) techniques to automatically generate comments/summaries for given code units. However, these ACS techniques primarily focus on generating summaries for code units at the method level. There is a significant lack of research on summarizing higher-level code units, such as file-level and module-level code units, despite the fact that summaries of these higher-level code units are highly useful for quickly gaining a macro-level understanding of software components and architecture. To fill this gap, in this paper, we conduct a systematic study on how to use LLMs for commenting higher-level code units, including file level and module level. These higher-level units are significantly larger than method-level ones, which poses challenges in handling long code inputs within LLM constraints and maintaining efficiency. To address these issues, we explore various summarization strategies for ACS of higher-level code units, which can be divided into three types: full code summarization, reduced code summarization, and hierarchical code summarization. The experimental results suggest that for summarizing file-level code units, using the full code is the most effective approach, with reduced code serving as a cost-efficient alternative. However, for summarizing module-level code units, hierarchical code summarization becomes the most promising strategy. In addition, inspired by the research on method-level ACS, we also investigate using the LLM as an evaluator to evaluate the quality of summaries of higher-level code units. The experimental results demonstrate that the LLM's evaluation results strongly correlate with human evaluations.

code unit, module, summarization, (14 more...)

arXiv.org Artificial Intelligence

2503.10737

Country:

North America > United States > California > San Francisco County > San Francisco (0.14)
Asia > China > Jiangsu Province > Nanjing (0.04)
Asia > Singapore > Central Region > Singapore (0.04)
(17 more...)

Genre:

Research Report > New Finding (1.00)
Research Report > Experimental Study (0.93)

Industry:

Health & Medicine > Therapeutic Area > Infections and Infectious Diseases (0.34)
Health & Medicine > Therapeutic Area > Immunology (0.34)
Health & Medicine > Epidemiology (0.34)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.99)

Add feedback

Advancing Neuromorphic Computing: Mixed-Signal Design Techniques Leveraging Brain Code Units and Fundamental Code Units

Isik, Murat, Miziev, Sols, Pawlak, Wiktoria, Howard, Newton

arXiv.org Artificial IntelligenceMar-18-2024

This paper introduces a groundbreaking digital neuromorphic architecture that innovatively integrates Brain Code Unit (BCU) and Fundamental Code Unit (FCU) using mixedsignal design methodologies. Leveraging open-source datasets and the latest advances in materials science, our research focuses on enhancing the computational efficiency, accuracy, and adaptability of neuromorphic systems. The core of our approach lies in harmonizing the precision and scalability of digital systems with the robustness and energy efficiency of analog processing. Through experimentation, we demonstrate the effectiveness of our system across various metrics. The BCU achieved an accuracy of 88.0% and a power efficiency of 20.0 GOP/s/W, while the FCU recorded an accuracy of 86.5% and a power efficiency of 18.5 GOP/s/W. Our mixed-signal design approach significantly improved latency and throughput, achieving a latency as low as 0.75 ms and throughput up to 213 TOP/s. These results firmly establish the potential of our architecture in neuromorphic computing, providing a solid foundation for future developments in this domain. Our study underscores the feasibility of mixedsignal neuromorphic systems and their promise in advancing the field, particularly in applications requiring high efficiency and adaptability

architecture, code unit, efficiency, (17 more...)

arXiv.org Artificial Intelligence

2403.11563

Country:

Europe > United Kingdom > England > Oxfordshire > Oxford (0.14)
North America > United States > Rhode Island > Providence County > Providence (0.04)
North America > United States > California > Santa Clara County > Stanford (0.04)
North America > United States > California > Santa Clara County > Palo Alto (0.04)

Genre:

Overview (0.93)
Research Report (0.82)

Industry:

Health & Medicine > Therapeutic Area > Neurology (1.00)
Semiconductors & Electronics (0.68)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)
Information Technology > Architecture (0.94)
Information Technology > Data Science (0.68)

Add feedback

Discovering Program Topoi Through Clustering

Ieva, Carlo (Simula Research Laboratory) | Gotlieb, Arnaud (Simula Research Laboratory) | Kaci, Souhila (LIRMM, University of Montpellier) | Lazaar, Nadjib (LIRMM, University of Montpellier)

AAAI ConferencesFeb-8-2018

Understanding source code of large open-source software projects is very challenging when there is only little documentation. New developers face the task of classifying a huge number of files and functions without any help. This paper documents a novel approach to this problem, called FEAT, that automatically extracts topoi from source code by using hierarchical agglomerative clustering. Program topoi summarize the main capabilities of a software system by presenting to developers clustered lists of functions together with an index of their relevant words. The clustering method used in FEAT exploits a new hybrid distance which combines both textual and structural elements automatically extracted from source code and comments. The experimental evaluation of FEAT shows that this approach is suitable to understand open-source software projects of size approaching 2,000 functions and 150 files, which opens the door for its deployment in the open-source community.

feat, machine learning, natural language, (19 more...)

AAAI Conferences

Thirty-Second AAAI Conference on Artificial Intelligence

Country: Europe (0.68)

Genre:

Research Report (0.67)
Workflow (0.46)
Overview > Innovation (0.34)

Technology:

Information Technology > Software (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (1.00)

Add feedback

Developing Population Codes by Minimizing Description Length

Zemel, Richard S., Hinton, Geoffrey E.

Neural Information Processing SystemsDec-31-1994

The Minimum Description Length principle (MDL) can be used to train the hidden units of a neural network to extract a representation that is cheap to describe but nonetheless allows the input to be reconstructed accurately. We show how MDL can be used to develop highly redundant population codes. Each hidden unit has a location in a low-dimensional implicit space. If the hidden unit activities form a bump of a standard shape in this space, they can be cheaply encoded by the center ofthis bump. So the weights from the input units to the hidden units in an autoencoder are trained to make the activities form a standard bump.

algorithm, implicit coordinate, implicit space, (14 more...)

Neural Information Processing Systems

Country:

North America > Canada > Ontario > Toronto (0.15)
North America > United States > California > San Francisco County > San Francisco (0.14)
Asia > Singapore (0.04)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Computational Learning Theory > Minimum Complexity Machines (0.50)

Add feedback

Developing Population Codes by Minimizing Description Length

Zemel, Richard S., Hinton, Geoffrey E.

Neural Information Processing SystemsDec-31-1994

The Minimum Description Length principle (MDL) can be used to train the hidden units of a neural network to extract a representation thatis cheap to describe but nonetheless allows the input to be reconstructed accurately. We show how MDL can be used to develop highly redundant population codes. Each hidden unit has a location in a low-dimensional implicit space. If the hidden unit activities form a bump of a standard shape in this space, they can be cheaply encoded by the center ofthis bump. So the weights from the input units to the hidden units in an autoencoder are trained to make the activities form a standard bump.

artificial intelligence, implicit space, machine learning, (16 more...)

Neural Information Processing Systems

Country:

North America > Canada > Ontario > Toronto (0.15)
North America > United States > California > San Francisco County > San Francisco (0.14)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Computational Learning Theory > Minimum Complexity Machines (0.49)

Add feedback